pBWT: Achieving Succinct Data Structures for Parameterized Pattern Matching and Related Problems

نویسندگان

  • Arnab Ganguly
  • Rahul Shah
  • Sharma V. Thankachan
چکیده

The fields of succinct data structures and compressed text indexing have seen quite a bit of progress over the last two decades. An important achievement, primarily using techniques based on the Burrows-Wheeler Transform (BWT), was obtaining the full functionality of the suffix tree in the optimal number of bits. A crucial property that allows the use of BWT for designing compressed indexes is order-preserving suffix links. Specifically, the relative order between two suffixes in the subtree of an internal node is same as that of the suffixes obtained by truncating the first character of the two suffixes. Unfortunately, in many variants of the text-indexing problem, for e.g., parameterized pattern matching, 2D pattern matching, and order-isomorphic pattern matching, this property does not hold. Consequently, the compressed indexes based on BWT do not directly apply. Furthermore, a compressed index for any of these variants has been elusive throughout the advancement of the field of succinct data structures. We achieve a positive breakthrough on one such problem, namely the Parameterized Pattern Matching problem. Let T be a text that contains n characters from an alphabet Σ, which is the union of two disjoint sets: Σs containing static characters (s-characters) and Σp containing parameterized characters (p-characters). A pattern P (also over Σ) matches an equal-length substring S of T iff the s-characters match exactly, and there exists a one-to-one function that renames the p-characters in S to that in P . The task is to find the starting positions (occurrences) of all such substrings S. Previous index [Baker, STOC 1993], known as Parameterized Suffix Tree, requires Θ(n log n) bits of space, and can find all occ occurrences in time O(|P | log σ+occ), where σ = |Σ|. We introduce an n log σ+O(n)-bit index with ∗Arnab Ganguly was partially supported by National Science Foundation (NSF) Grants CCF–1218904 and CCF–1527435, and a Louisiana State University (LSU) Dissertation Fellowship. †School of EECS at LSU. Email: [email protected] ‡School of EECS at LSU, and NSF. Email: [email protected], [email protected] §Department of CS at the University of Central Florida (UCF). Email: [email protected] O(|P | log σ+occ·log n log σ) query time. At the core, lies a new BWT-like transform, which we call the Parameterized Burrows-Wheeler Transform (pBWT). The techniques are extended to obtain a succinct index for the Parameterized Dictionary Matching problem of Idury and Schäffer [CPM, 1994].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parameterized Pattern Matching - Succinctly

The fields of succinct data structures and compressed text indexing have seen quite a bit of progress over the last 15 years. An important achievement, primarily using techniques based on the Burrows-Wheeler Transform (BWT), was obtaining the full functionality of suffix tree in the optimal number of bits. A crucial property that allows the use of BWT for designing compressed indexes is order-p...

متن کامل

Parameterized matching on non-linear structures

The classical pattern matching paradigm is that of seeking occurrences of one string in another, where both strings are drawn from an alphabet set Σ. In the parameterized pattern matching model, a consistent renaming of symbols from Σ is allowed in a match. The parameterized matching paradigm has proven useful in problems in software engineering, computer vision, and other applications. In clas...

متن کامل

Entropy-Compressed Indexes for Multidimensional Pattern Matching

In this talk, we will discuss the challenges involved in developing a multidimensional generalizations of compressed text indexing structures. These structures depend on some notion of Burrows-Wheeler transform (BWT) for multiple dimensions, though naive generalizations do not enable multidimensional pattern matching. We study the 2D case to possibly highlight combinatorial properties that do n...

متن کامل

Applications of Succinct Dynamic Compact Tries to Some String Problems

The dynamic compact trie is a fundamental data structure for a wide range of string processing problems. In this paper, we report our recent work on succinct dynamic compact tries that stores a set of strings of total length n in O(n log σ) space supporting pattern matching and insert/delete operations in O((|P |/α)f(n)) time, where P is a pattern string, α = Θ(logσ n), and f(n) = O((log logn) ...

متن کامل

Optimal Trade-Offs for Succinct String Indexes

Let s be a string whose symbols are solely available through access(i), a read-only operation that probes s and returns the symbol at position i in s. Many compressed data structures for strings, trees, and graphs, require two kinds of queries on s: select(c, j), returning the position in s containing the jth occurrence of c, and rank(c, p), counting how many occurrences of c are found in the f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017